REINFORCE

You've learned that our goal is to find the values of the weights \theta in the neural network that maximize the expected return U

U(\theta) = \sum_\tau P(\tau;\theta)R(\tau)

where \tau is an arbitrary trajectory. One way to determine the value of \theta that maximizes this function is through gradient ascent. This algorithm is closely related to gradient descent, where the differences are that:

gradient descent is designed to find the minimum of a function, whereas gradient ascent will find the maximum, and
gradient descent steps in the direction of the negative gradient, whereas gradient ascent steps in the direction of the gradient.

Our update step for gradient ascent appears as follows:

\theta \leftarrow \theta + \alpha \nabla_\theta U(\theta)

where \alpha is the step size that is generally allowed to decay over time. Once we know how to calculate or estimate this gradient, we can repeatedly apply this update step, in the hopes that \theta converges to the value that maximizes U(\theta).

## Video

M3L3 C05 V2

## Pseudocode

The algorithm described in the video is known as REINFORCE. The pseudocode is summarized below.

Use the policy \pi_\theta to collect m trajectories { \tau^{(1)}, \tau^{(2)}, \ldots, \tau^{(m)}} with horizon H. We refer to the i-th trajectory as
\tau^{(i)} = (s_0^{(i)}, a_0^{(i)}, \ldots, s_H^{(i)}, a_H^{(i)}, s_{H+1}^{(i)})
.
Use the trajectories to estimate the gradient \nabla_\theta U(\theta):
\nabla_\theta U(\theta) \approx \hat{g} := \frac{1}{m}\sum_{i=1}^m \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) R(\tau^{(i)})
Update the weights of the policy:
\theta \leftarrow \theta + \alpha \hat{g}
Loop over steps 1-3.